Fix missing pages in search indexes #212
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #211
I noticed there were significant pages missing from the web search.
After some investigation I found there were 2 circumstances where entries were missing.
In the first case, some entries were being excluded by the
!$index['chunk']check. I found these could easily be re-included by skipping this check for specificelementvalues. You can see the list of these running the following query on theoutput/index.sqlitefile generated by a phd run:Additionally entries were excluded where they all had the same docbook_id because of the way the
indexesarray is indexed by this id. Because the existing search indexes rely on this id, it's not easy to resolve.However, the current web search actually reworks these indexes to combine them anyway, and I'd created a pre-combined version of these indexes for #204 that does not rely on indexing by the docbook_id.
I've copied these changes to this PR and then modified them to pull the index list from the database without the deduplication. These will require further changes to php-web
js/search,jsandjs/search-index.php- I've already made the changes to search.js for #204 (and the changes tosearch-index.phpis just having it spit out the new -combined index file without any manipulation it currently does)You can see the list of affected docbook_ids with the following query:
Generally these are cases where there are both procedural and OOP interfaces. I'd guess which one makes it into the indexes is determined by the order in which the
<refname>values appear.There are some other cases such as stream wrappers (eg. bzip2:// and zlib://)